!pip install auto-sklearn -q
!pip install openml -q
!pip install PipelineProfiler -q
!pip install H2O -q
!pip install gama -q
!pip install transformers -q
!pip install datasets -q
# restart runtime after installing
Standing on the shoulders of giants. This week we will be using very large models trained om huge datasets. Using Transfer Learning you will be able to fit these pre-trained models on your data. This will save you a lot of time and processing power. Furthermore we'll walk you through some extremely useful AutoML libraries. Knowing how to use that will potentially take a lot of work out of your hands. Let's start with some AutoML!
# First we are going to take a look at the autosklearn library
# Restart runtime if it craches
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import PipelineProfiler
# load the data set and split
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
A) Have a look at the shape of the data
B) What are the numbers suposed to mean in each instance?
C) Look up https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html
D) Create a plot that visualises an instance of X
print(X[0])
digits = sklearn.datasets.load_digits()
print(digits.images[0])
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[0])
plt.show()
print('Label:',y[0])
Alright, now we are going to train a classifier using auto-sklearn to classify each number. Run the following code. It should take around a minute and you should get an accuracy of >.98
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60, # sec., how long should this seed fit process run
per_run_time_limit=15, # sec., each model may only take this long before it's killed
)
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
Okay.... Looks cool! But how do we know now what our model looks like? And on what parameters it is tuned?
A) Look up in the manual how to inspect your results https://automl.github.io/auto-sklearn/master/manual.html
B) What model perfroms best?
C) Find the parameters set for you model
D) Look up PipelineProfiler and run it to inspect your trained autoML
# Print the final ensemble constructed by auto-sklearn.
print(automl.show_models())
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)
Below you find a classification example of using H2O AutoML. You can run the code to generate the results but this can take up to 30 minutes.
import h2o
from h2o.automl import H2OAutoML
# Start the H2O cluster (locally)
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Run AutoML for 20 base models
aml = H2OAutoML(max_models=10, seed=1)
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows) # Print all rows instead of default (10 rows)
# The leader model is stored here
aml.leader
We'll use the Combined Cycle Power Plant dataset. The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values. In this demo, you will use H2O's AutoML to outperform the state of the art results on this task.
import os
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"
# Load data into H2O
df = h2o.import_file(data_path)
A) Use .describe() to get a sense of what the data looks like
B) Split your dataframe in 80% train and 20% test
C) Run H2OAutoML of your dataframe and set 'HourlyEnergyOutputMW' as y. Tip: Set max_runtime_secs to 120 to avoid training for a really long time.
D) Check the leaderboard to see what model performed best.
df.describe()
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_lb_frame")
aml.train(y = 'HourlyEnergyOutputMW', training_frame = train, leaderboard_frame = test)
aml.leaderboard.head()
A) Use the model you trained in exersice 3 to predict the HourlyEnergyOutputMW in your test set.
B) H2O has a function called model_perfomance to evaluate your model on test data. Use it to see how well your model predicts the test data.
pred = aml.predict(test)
perf = aml.leader.model_performance(test)
perf
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
from gama import GamaClassifier
if __name__ == "__main__":
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=0
)
automl = GamaClassifier(max_total_time=180, store="nothing", n_jobs=1)
print("Starting `fit` which will take roughly 3 minutes.")
automl.fit(X_train, y_train)
label_predictions = automl.predict(X_test)
probability_predictions = automl.predict_proba(X_test)
print("accuracy:", accuracy_score(y_test, label_predictions))
print("log loss:", log_loss(y_test, probability_predictions))
A) Take the Combined Cycle Power Plant dataset from the previous assignemnt and the Digits dataset for the first exercise and use GAMA with the same prediction goals.
B) Evaluate the difference in the resulting metrics, but more over in pipeline structure. Which is your favorite AutoML library?
First we will have a look at the huggingface library. This is a library full of large pretrained models that are easily to be installed and used. Huggingface has a large amount of NLP (Natural Language Processing) algorithms but also offers alorithms for audio and vision processing. Check out their site for all the available models: https://huggingface.co/models
Here below we give an example of an algorithm named "bert-base-NER". bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task (Named Entity Recognition). It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
For more info on named entity recognition you can check out this paper https://aclanthology.org/W03-0419.pdf
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "I'm Dorian from Utrecht and I work for Fruitpunch AI."
ner_results = nlp(example)
print(ner_results)
A) Load the reviews dataset from one of our previous ai-code sessions.
B) The code below displays how to build an text classfier from scratch. Run the code and check the highest reached accuracy. (This can take ~5 minutes)
C) Search the huggingface library for a model that can classify the reviews in this dataset on positive or negative. And run it on the data.
D) Eveluate your transfered model. Does it outperform the models build from scratch?
!git clone https://github.com/fruitpunch-ai-code/epoch-14.git
#required libraries
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn import model_selection
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
#Set Random seed
np.random.seed(500)
df = pd.read_csv('/content/epoch-14/Challenges/reviews.csv', encoding='latin-1')
df.head()
# Step 1: Data Pre-processing - This will help in getting better results through the classification algorithms
Corpus = df.copy()
# Step 1a : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)
# Step - 1b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [entry.lower() for entry in Corpus['text']]
# Step - 1c : Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]
# Step - 1d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
for index,entry in enumerate(Corpus['text']):
# Declaring Empty List to store the words that follow the rules for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
for word, tag in pos_tag(entry):
# Below condition is to check for Stop words and consider only alphabets
if word not in stopwords.words('english') and word.isalpha():
word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
Final_words.append(word_Final)
# The final processed set of words for each iteration will be stored in 'text_final'
Corpus.loc[index,'text_final'] = str(Final_words)
# Step 2: Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)
# Step 3: Label encode the target variable - This is done to transform Categorical data of string type in the data set into numerical values
from sklearn.preprocessing import LabelEncoder
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
# Step 4: Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comparison to the corpus
from sklearn.feature_extraction.text import TfidfVectorizer
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
# Step 5: Now run ML algorithm to classify the text
# Classifier - Algorithm - Naive Bayes
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score
# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
# Classifier - Algorithm - SVM
from sklearn import svm
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
# Classifier - Algorithm - Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
# fit the training dataset on the classifier
clf = RandomForestClassifier(n_estimators=400, max_depth=20, random_state=0)
clf.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_clf = clf.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Random Forest Classifier Accuracy Score -> ",accuracy_score(predictions_clf, Test_Y)*100)
# Classifier - Algorithm - AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier
# fit the training dataset on the classifier
adaclf = AdaBoostClassifier(n_estimators=800, random_state=0)
adaclf.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_adaclf = adaclf.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("AdaBoost Classifier Accuracy Score -> ",accuracy_score(predictions_adaclf, Test_Y)*100)
# Classifier - Algorithm - Linear SVC
from sklearn.svm import LinearSVC
# fit the training dataset on the classifier
svc = LinearSVC(random_state=0, tol=1e-5)
svc.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_svc = svc.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Linear SVC Accuracy Score -> ",accuracy_score(predictions_svc, Test_Y)*100)
# Classifier - Algorithm - Logistic Regression
from sklearn.linear_model import LogisticRegression
# fit the training dataset on the classifier
lr = LogisticRegression()
lr.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_lr = lr.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Logistic Regression Accuracy Score -> ",accuracy_score(predictions_lr, Test_Y)*100)
# Classifier - Algorithm - MLP Classifier
from sklearn.neural_network import MLPClassifier
# fit the training dataset on the classifier
mlp = MLPClassifier(hidden_layer_sizes=(13,13,13),max_iter=500)
mlp.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_nn = mlp.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("MLPClassifier Accuracy Score -> ",accuracy_score(predictions_nn, Test_Y)*100)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForQuestionAnswering, pipeline
model_name = "aychang/roberta-base-imdb"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
nlp = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)
# nlp only takes lists as input
list_of_text = df.text.tolist()
# To speed things up a little bit grab a subset of the data
results = nlp(list_of_text[0:1000])
# extract the predictions from the results
pred = []
for i in range(1000):
if results[i]['label'] == 'pos':
pred.append(1)
if results[i]['label'] == 'neg':
pred.append(0)
# Calculate accuracy score but first transform labels to numbers
df['label'] = Encoder.fit_transform(df['label'])
print("Transfered model Accuracy Score -> ",accuracy_score(pred, df['label'][0:1000])*100)
Besides libaries like huggingface there are many other models you can find a trained version from online. These can be huge networks that have been trained for months on a full datacenter with the newest GPUs like GTP 2 and 3, or smaller architectrues like ResNet50 or VGG16. While these models are doable to train from scratch yourself, you might not have enough data or simply would like to reduce training time, cost or evironmental impact. These models come with a set op pretrained weights and you can then transfer learn your own dataset using the experience of the pretrained model.
In this exercize we will be looking at the YOLO (You Only Look Once) object detection model. While there are more advanced object decteion models like Facebook's Detectron2 (RCNN based) and DETR (Transformer based) models, these are huge models which still take a long time to transfer learn. Yolo's fifth version provides state of the art performance in relatively small models (in different size verarieties) which makes it very useful for edge detection cases and cases where the inference speed or used compute resources are relevant.
Firts we will be downloading the COCO2017 dataset, which is one of the most commono object detection benschmarking datasets. Then we will download Yolov5 and get to the exersize.
# Download COCO test-dev2017, will take a couple of minutes
!wget https://ultralytics.com/assets/coco2017labels.zip
!unzip -q coco2017labels.zip -d ../datasets && rm coco2017labels.zip
!f="test2017.zip" && curl http://images.cocodataset.org/zips/$f -o $f && unzip -q $f -d ../datasets/coco/images
# Download yolo5 and install its dependencies
!git clone https://github.com/ultralytics/yolov5 # clone repo
!pip install -qr yolov5/requirements.txt # install dependencies
A) Train a Yolov5 S variant on the data using the train.py script in the yolov5 folder with 3 epochs. (the data should already be in the right spot)
B) Go the the results folder in yolov5 folder and evaluate the results of the experiment you just ran
C) Select a few photos (from ../datasets/coco128/images/train2017) and use the detect.py script to detect the objects on it.
D) Choose a larger variant of the Yolov5 model and repeat training and detection on the same photos, are the results significantly better compared to the loger training time?
!python yolov5/train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache #--device 0
!python yolov5/detect.py --source ../datasets/coco128/images/train2017/000000000009.jpg --weights yolov5/yolov5s.pt --conf 0.4